Skip to main content

The Dataset Collection

The Dataset Collection consists of large data archives from both sites and individuals.

492
RESULTS
rss


Media Type
4
collections
424
data
63
software
1
web
Topics & Subjects
3
UPC
3
Universal Product Code
3
barcode
2
dataset
1
Agora
1
Bitcoin
More right-solid
Collection
More right-solid
Creator
48
discogs.org
1
academictorrents.com
1
aiminer.org
1
allen institute for artificial intelligence
1
anonymous
1
citeseerx group at psu
More right-solid
Language
50
English
SHOW DETAILS
up-solid down-solid
eye
Title
Date Archived
Creator
The Dataset Collection
data
eye 30,595
favorite 6
comment 3
(Here is the original Reddit comment announcing this collection of data and what the processes were.) This is an archive of Reddit comments from October of 2007 until May of 2015 (complete month). This reflects 14 months of work and a lot of API calls. This dataset includes nearly every publicly available Reddit comment. Approximately 350,000 comments out of ~1.65 billion were unavailable due to Reddit API issues. Q: How are the files structured? Each file is compressed with bzip2 compression....
favoritefavoritefavoritefavoritefavorite ( 3 reviews )
The Dataset Collection
by Gwern Branwen
data
eye 24,743
favorite 3
comment 0
Dark Net Markets (DNM) are online markets typically hosted as Tor hidden services whose users transact in Bitcoin or other cryptocoins, usually for drugs or other illegal/regulated goods; the most famous DNM was Silk Road 1, which pioneered the business model. From 2013-2015, I scraped/mirrored on a weekly or daily basis all existing English-language DNMs as part of my research into their usage, lifetimes/characteristics, & legal riskiness; in addition, I made or obtained copies of as many...
Topics: Tor, Bitcoin, drugs, Silk Road, Evolution, Agora, black-markets, dark net markets
MusicBrainz Data Dumps
collection
396
ITEMS
12,406
VIEWS
collection
eye 12,406
The MusicBrainz Database is built on the PostgreSQL relational database engine and contains all of MusicBrainz' music metadata. This data includes information about artists, release groups, releases, recordings, works, and labels, as well as the many relationships between them. The database also contains a full history of all the changes that the MusicBrainz community has made to the data. Core data Artists Name, sort name, IPI, aliases, type, begin and end dates, disambiguation comment, MBID...
Dumps of DISCOGS.ORG Metadata (2008-Present)
collection
47
ITEMS
11,891
VIEWS
by DISCOGS.ORG
collection
eye 11,891
This is an unofficial mirror of the DISCOGS.ORG data collection, which is located at http://www.discogs.com/data/ . Discogs, short for discographies, is a website and database of information about audio recordings, including commercial releases, promotional releases, and bootleg or off-label releases. The Discogs servers, currently hosted under the domain name discogs.com, are owned by Zink Media, Inc., and are located in Portland, Oregon, USA. Discogs is one of the largest online databases of...
The Dataset Collection
by Internet Archive
data
eye 11,503
favorite 3
comment 1
Culled from various sources, this collection includes over one million JPG, PNG and GIF album covers. The resolution ranges from "thumbnail" through to very large sizes. Filenames are variant in usefulness, although a good number indicate at least the name of the original album. This dataset is for experimentation and image processing research only. At 148gb, the collection is large but not unmanageable (there is a torrent available) and allows a developer or artist to work with the...
favoritefavoritefavoritefavoritefavorite ( 1 reviews )
Topics: dataset, big data, album covers, covers, cover art, cover photos
The Dataset Collection
by NYC Taxi and Limousine Commission
data
eye 6,328
favorite 2
comment 0
FOIA/FOILed Taxi Trip Data from the NYC Taxi and Limousine Commission 2013. Released by http://chriswhong.com/open-data/foil_nyc_taxi/ trip_data.7z and trip_fare.7z are more efficiently compressed versions of the data, you probably want these files. The data is in csv format. For the data files this includes the fields: medallion, hack_license, vendor_id, rate_code, store_and_fwd_flag, pickup_datetime, dropoff_datetime, passenger_count, trip_time_in_secs, trip_distance, pickup_longitude,...
Topics: data, nyc, taxi, fare, csv, FOIA, FOIL
Source: torrent:urn:sha1:6c594866904494b06aae51ad97ec7f985059b135
Internet Census 2012
collection
15
ITEMS
4,403
VIEWS
by Anonymous
collection
eye 4,403
Abstract While playing around with the Nmap Scripting Engine (NSE) we discovered an amazing number of open embedded devices on the Internet. Many of them are based on Linux and allow login to standard BusyBox with empty or default credentials. We used these devices to build a distributed port scanner to scan all IPv4 addresses. These scans include service probes for the most common ports, ICMP ping, reverse DNS and SYN scans. We analyzed some of the data to get an estimation of the IP address...
The Dataset Collection
data
eye 3,628
favorite 1
comment 0
I took the Reddit comment archive and converted all the JSON into one SQLite database using this program that I wrote: https://gist.github.com/ers35/3b615a75fa0ed5e6d5cc I ran a few tests to make sure the number of database rows matches the number of JSON records. "SELECT MAX(rowid) FROM comment" and "SELECT COUNT(id) FROM comment" both return 1659361605. This gives me some confidence as to the integrity of the dataset, but I cannot be 100% sure. The compressed size is 163G....
The Dataset Collection
data
eye 1,699
favorite 0
comment 0
Large sets of malware examples for the purposes of research, comparison, and history. This is the Various set, which is a volume of specific smaller sets of malware.
The Dataset Collection
software
eye 1,605
favorite 0
comment 0
All the "journal article" DOIs from CrossRef's OAI-PMH server; URLs of just under 50 million journal articles.
Topics: doi, dataset
The Dataset Collection
data
eye 1,414
favorite 0
comment 0
Database of UPC product codes, as compiled by upcdatabase.com
Topics: UPC, Universal Product Code, barcode
The Dataset Collection
data
eye 828
favorite 0
comment 0
Database of UPC product codes, as compiled by upcdatabase.com
Topics: UPC, Universal Product Code, barcode
The Dataset Collection
data
eye 752
favorite 0
comment 0
A collection of fanfiction stories from fanfiction.net, repacked for easier bulk collecting and archiving. Contains many tens of thousands of fan fiction stories.
The Dataset Collection
data
eye 608
favorite 0
comment 0
Minecraft Archive Project: MINECRAFTFORUM.NET
The Dataset Collection
by William W. Cohen, MLD, CMU
web
eye 480
favorite 2
comment 0
This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5M messages. This data was originally made public, and posted to the web , by the Federal Energy Regulatory Commission during its investigation. The email dataset was later purchased by Leslie Kaelbling at MIT, and turned out to have a number of...
Topics: Enron, E-mail, Dataset
The Dataset Collection
data
eye 427
favorite 0
comment 0
Database of UPC product codes, as compiled by upcdatabase.com http://sourceforge.net/projects/upcdatabase/
Topics: UPC, Universal Product Code, barcode
Dumps of DISCOGS.ORG Metadata (2008-Present)
by DISCOGS.ORG
software
eye 378
favorite 0
comment 0
This is the monthly dump of DISCOGS.ORG data, provided to the public domain. This dump has been generated and archived automatically. Official name is discogs_20130103 and the data is from 2013-01-03.
The Dataset Collection
by Weiwei Zhang, Jian Sun, and Xiaoou Tang
data
eye 365
favorite 0
comment 0
This dataset mirrored from http://137.189.35.203/WebUI/CatDatabase/catData.html, which circa May 2017 is a dead link. The original page is available in Wayback: https://web.archive.org/web/20150520175645/http://137.189.35.203/WebUI/CatDatabase/catData.html The CAT dataset includes 10,000 cat images. For each image, we annotate the head of cat with nine points, two for eyes, one for mouth, and six for ears. The detail configuration of the annotation was shown in Figure 6 of the original paper:...
Topics: cats, datasets, computer vision
The Dataset Collection
data
eye 342
favorite 0
comment 0
CITY OF SCREENSHOTS is a dataset of automatically screenshotted computer programs taken from the uploaded items in the Internet Archive's software collection. An ever-growing dataset, it currently contains over 450,000 individual screenshot images from over 60,000 software programs for the Apple II, Atari 8-Bit, MS-DOS, Windows, ZX Spectrum and Tandy Color Computer. Screenshots were generated by automated script: An instance of Firefox would be spun up and aimed at the emulated item on the...
Topics: screenshots, computers, screens, artwork
Academic Torrents
collection
0
ITEMS
257
VIEWS
by ACADEMICTORRENTS.COM
collection
eye 257
Welcome to Academic Torrents! Making 14.15TB of research data available. We've designed a distributed system for sharing enormous datasets - for researchers, by researchers. The result is a scalable, secure, and fault-tolerant repository for data, with blazing fast download speeds.
MusicBrainz Data Dumps
data
eye 249
favorite 0
comment 0
MusicBrainz Database Dump 20141004-015144
Internet Census 2012
software
eye 218
favorite 0
comment 0
Internet Census 2012
software
eye 214
favorite 0
comment 0
Dumps of DISCOGS.ORG Metadata (2008-Present)
by DISCOGS.ORG
software
eye 162
favorite 0
comment 0
This is the monthly dump of DISCOGS.ORG data, provided to the public domain. This dump has been generated and archived automatically. Official name is discogs_20110601 and the data is from 2011-06-01.
The Dataset Collection
by GeoNames Contributors
data
eye 120
favorite 0
comment 0
This is a complete copy of the GeoNames database dumps fetched on 2017-06-13 from http://www.geonames.org/export/. It does not include any postal code information. See the readme.txt file for more information. " The GeoNames geographical database covers all countries and contains over eleven million placenames that are available for download free of charge."
Dumps of DISCOGS.ORG Metadata (2008-Present)
by DISCOGS.ORG
software
eye 105
favorite 0
comment 0
This is the monthly dump of DISCOGS.ORG data, provided to the public domain. This dump has been generated and archived automatically. Official name is discogs_20130301 and the data is from 2013-03-01.
Internet Census 2012
software
eye 102
favorite 0
comment 0
Internet Census 2012
software
eye 87
favorite 0
comment 0
The Dataset Collection
data
eye 84
favorite 0
comment 0
Manifest of Internet Archive's identified scholarly works in digital form (eg, journal articles). See README.html for details.
Internet Census 2012
software
eye 83
favorite 0
comment 0
MusicBrainz Data Dumps
data
eye 81
favorite 0
comment 0
MusicBrainz Database Dump 20150523-002918
The Dataset Collection
by library.link
data
eye 81
favorite 0
comment 0
This item contains a snapshot of library holdings by ISBN, as provided by http://library.link (Zepheira Technologies). A seed list of 436 domains was fetched from http://library.link/harvest/sitemap.xml, and for each domain found the file `http://DOMAIN/id/isbn/all/` was fetched. When run on 2017-06-13, all files except that at `link.library.brandeis.edu` fetched successfully, and are in `isbn_html.zip`. ISBN numbers were then extracted into raw lists by domain (`isbn_lists.zip`), and some...
Internet Census 2012
software
eye 80
favorite 0
comment 0
Internet Census 2012
software
eye 79
favorite 0
comment 0
Internet Census 2012
software
eye 74
favorite 0
comment 0
MusicBrainz Data Dumps
data
eye 72
favorite 0
comment 0
MusicBrainz Database Dump 20140924-014210
Internet Census 2012
software
eye 66
favorite 0
comment 0
Internet Census 2012
software
eye 65
favorite 0
comment 0
Internet Census 2012
software
eye 65
favorite 0
comment 0
MusicBrainz Data Dumps
data
eye 65
favorite 0
comment 0
MusicBrainz Database Dump 20140215-002226
MusicBrainz Data Dumps
data
eye 65
favorite 0
comment 0
MusicBrainz Database Dump 20140329-003047
Internet Census 2012
software
eye 65
favorite 0
comment 0
Internet Census 2012
software
eye 62
favorite 0
comment 0
Dumps of DISCOGS.ORG Metadata (2008-Present)
by DISCOGS.ORG
software
eye 62
favorite 0
comment 0
This is the monthly dump of DISCOGS.ORG data, provided to the public domain. The data is in XML format and formatted according to the API spec: http://www.discogs.com/developers/. This dump has been generated and archived automatically. Official name is discogs_20080309 and the data is from 2008-03-09.
MusicBrainz Data Dumps
data
eye 56
favorite 0
comment 0
MusicBrainz Data Dump 20120623-002010
Topics: database, music, musicbrainz
Internet Census 2012
software
eye 56
favorite 0
comment 0
Internet Census 2012
software
eye 55
favorite 0
comment 0
MusicBrainz Data Dumps
data
eye 54
favorite 0
comment 0
MusicBrainz Database Dump 20150408-002807
MusicBrainz Data Dumps
data
eye 49
favorite 0
comment 0
MusicBrainz Database Dump 20140705-002713
The Dataset Collection
data
eye 49
favorite 0
comment 1
Large collection of Minecraft modifications. files directory is in a files.zip ZIP file for ease of transfer, but should be unpacked when being used.
favoritefavoritefavoritefavorite ( 1 reviews )
The Dataset Collection
by Microsoft Academic Search
data
eye 48
favorite 0
comment 0
This is a copy of the Microsoft Academic Graph corpus of scholarly publications and citations, based on crawls from the open web. Metadata (authors, DOI numbers, journals, citations, keywords, affiliations, etc) is included for more than 125 million publications. The corpus is a single 27GB zipfile that extracts into about 96GB of flat tab-separated text files, cross-referenced using identifier columns. Schema information can be found in the `readme.txt` file, and usage restrictions can be...
MusicBrainz Data Dumps
data
eye 43
favorite 1
comment 0
MusicBrainz Database Dump 20150124-024159
MusicBrainz Data Dumps
data
eye 39
favorite 0
comment 0
MusicBrainz Database Dump 20140716-002400
Dumps of DISCOGS.ORG Metadata (2008-Present)
by DISCOGS.ORG
software
eye 39
favorite 0
comment 0
This is the monthly dump of DISCOGS.ORG data, provided to the public domain. This dump has been generated and archived automatically. Official name is discogs_20121101 and the data is from 2012-11-01.
Dumps of DISCOGS.ORG Metadata (2008-Present)
by DISCOGS.ORG
software
eye 39
favorite 0
comment 0
This is the monthly dump of DISCOGS.ORG data, provided to the public domain. This dump has been generated and archived automatically. Official name is discogs_20081014 and the data is from 2008-10-14.
Dumps of DISCOGS.ORG Metadata (2008-Present)
by DISCOGS.ORG
software
eye 38
favorite 0
comment 0
This is the monthly dump of DISCOGS.ORG data, provided to the public domain. This dump has been generated and archived automatically. Official name is discogs_20110107 and the data is from 2011-01-07.
MusicBrainz Data Dumps
data
eye 37
favorite 0
comment 0
MusicBrainz Database Dump 20140806-002953
Dumps of DISCOGS.ORG Metadata (2008-Present)
by DISCOGS.ORG
software
eye 35
favorite 0
comment 0
This is the monthly dump of DISCOGS.ORG data, provided to the public domain. This dump has been generated and archived automatically. Official name is discogs_20120601 and the data is from 2012-06-01.
Dumps of DISCOGS.ORG Metadata (2008-Present)
by DISCOGS.ORG
software
eye 35
favorite 0
comment 0
This is the monthly dump of DISCOGS.ORG data, provided to the public domain. This dump has been generated and archived automatically. Official name is discogs_20090701 and the data is from 2009-07-01.
The Dataset Collection
data
eye 35
favorite 0
comment 0
Names and other information about the individuals held in the ten WRA camps during the WWII incarceration. Imported from dat://9df6c69a6337cb24d6a45f8a71364f8e58fa37608f14ca37fec743a856b3ed97
MusicBrainz Data Dumps
data
eye 33
favorite 0
comment 0
MusicBrainz Database Dump 20140611-002358
Dumps of DISCOGS.ORG Metadata (2008-Present)
by DISCOGS.ORG
software
eye 33
favorite 0
comment 0
This is the monthly dump of DISCOGS.ORG data, provided to the public domain. This dump has been generated and archived automatically. Official name is discogs_20130101 and the data is from 2013-01-01.
MusicBrainz Data Dump 20111231-001743 (partial, no edit data included)
MusicBrainz Data Dumps
data
eye 31
favorite 0
comment 0
MusicBrainz Database Dump: 20131012-002238
MusicBrainz Data Dumps
data
eye 30
favorite 0
comment 0
MusicBrainz Database Dump: 20130612-002814
MusicBrainz Data Dumps
data
eye 29
favorite 0
comment 0
MusicBrainz Database Dump 20140719-003856
MusicBrainz Data Dumps
data
eye 29
favorite 0
comment 0
MusicBrainz Database Dump 20141217-015115
MusicBrainz Data Dumps
data
eye 29
favorite 0
comment 0
MusicBrainz Database Dump 20140219-002209
Dumps of DISCOGS.ORG Metadata (2008-Present)
by DISCOGS.ORG
software
eye 29
favorite 0
comment 0
This is the monthly dump of DISCOGS.ORG data, provided to the public domain. This dump has been generated and archived automatically. Official name is discogs_20100601 and the data is from 2010-06-01.
MusicBrainz Data Dumps
data
eye 29
favorite 0
comment 0
MusicBrainz Database Dump 20131116-002106
MusicBrainz Data Dumps
data
eye 29
favorite 0
comment 0
MusicBrainz Database Dump 20140809-003420
MusicBrainz Data Dumps
data
eye 28
favorite 0
comment 0
MusicBrainz Database Dump 20140726-002617
Dumps of DISCOGS.ORG Metadata (2008-Present)
by DISCOGS.ORG
software
eye 28
favorite 0
comment 0
This is the monthly dump of DISCOGS.ORG data, provided to the public domain. This dump has been generated and archived automatically. Official name is discogs_20090901 and the data is from 2009-09-01.
Dumps of DISCOGS.ORG Metadata (2008-Present)
by DISCOGS.ORG
software
eye 28
favorite 0
comment 0
This is the monthly dump of DISCOGS.ORG data, provided to the public domain. This dump has been generated and archived automatically. Official name is discogs_20090304 and the data is from 2009-03-04.
MusicBrainz Data Dumps
data
eye 28
favorite 0
comment 0
MusicBrainz Database Dump 20140319-002202